This report on the answers of the exercises outline Activity 2 on Data transformation all done in R Studio.
Let’s load the prerequisite libraries, we have assumes that packages has been installed.
library("nycflights13")
library("tidyverse")
## -- Attaching packages ---------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Checking data
?flights
## starting httpd help server ... done
In order to find flights with an arrival delay of two or more ours, we should covert hours into minutes Since the arr_delay variable is measured in minutes. Therefore, the check becomes 120 or more minutes.
filter(flights, arr_delay >= 120)
We are looking into those flights where the destination (dest) is either “IAH” or “HOU”.
THere are two ways to filter it. It shows same answer but the first is more concise.
First:
filter(flights,dest %in% c("IAH", "HOU"))
Second:
filter(flights, dest == "IAH" | dest == "HOU")
Let’s check first what are the airline code for United, American or Delta.
airlines
We can now see that UA is for United Airlines, AA is for American Airlines, and DL is for Delta.
filter(flights, carrier %in% c("AA", "DL", "UA"))
MOnths representation are in number. July is 7, August is 8 and soon.
filter(flights, month >= 7, month <= 9)
using some mathematical operators, we can run:
filter(flights, arr_delay > 120, dep_delay <= 0)
Let’s analyze:
If a flight was delayed by at least an hour, then dep_delay >= 60. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning dep_delay == arr_delay, or alternatively, dep_delay - arr_delay == 0. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as dep_delay - arr_delay > 30.
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
How does midnight is depicted here? This is a riddle so we need to check summary.
summary(flights$dep_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 907 1401 1349 1744 2400 8255
Viola midnight is 2400, not 0 as first thought.
filter(flights, dep_time <= 600 | dep_time == 2400)
filter(flights, between(month, 7, 9))
I noticed that the expression between(x, left, right) is equivalent to x >= left & x <= right.
Find the rows of flights with a missing departure time (dep_time) using the is.na() function.
filter(flights, is.na(dep_time))
IT looks like these missing departure times are those canceled flights.
Let’s check:
NA ^ 0
## [1] 1
NA ^ 0 == 1 since for all numeric values x0=1 .
NA | TRUE
## [1] TRUE
NA | TRUE is TRUE because anything or TRUE is TRUE. By using OR operator, if the missing value were TRUE, then TRUE | TRUE == TRUE, and if the missing value was FALSE, then FALSE | TRUE == TRUE.
NA & FALSE
## [1] FALSE
The value of NA & FALSE is FALSE because anything and FALSE is always FALSE. If the missing value were TRUE, then TRUE & FALSE == FALSE, and if the missing value was FALSE, then FALSE & FALSE == FALSE.
NA | FALSE
## [1] NA
For NA | FALSE, the value is unknown since TRUE | FALSE == TRUE, but FALSE | FALSE == FALSE.
NA & FALSE
## [1] FALSE
For NA & TRUE, the value is unknown since FALSE & TRUE== FALSE, but TRUE & TRUE == TRUE.
NA * 0
## [1] NA
Since x∗0=0 for all finite numbers we might expect NA * 0 == 0, but that’s not the case. The reason that NA * 0 != 0 is that 0×∞ and 0×−∞
are undefined. R represents undefined results as NaN, which is an abbreviation of “not a number”.
Inf * 0
## [1] NaN
-Inf * 0
## [1] NaN
Using arrange() in default call,
arrange(flights, dep_time)
Let’s look where are the NAs, it looks like they are the end of the list.
arrange(flights, dep_time) %>%
tail()
So let;s use is.na() to put all these NAs at the start of list for sorting.
arrange(flights, desc(is.na(dep_time)), dep_time)
The flights will first be sorted by desc(is.na(dep_time)) because desc(is.na(dep_time)) function is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.
To find the most delayed flights, we will sort the table using departure delay, dep_delay, in descending order.
arrange(flights, desc(dep_delay))
The most delayed flights were scheduled to leave on January 09, 2013 09:00. Noticed that the departure time is given as 641, which is 641AM the next day, so around 21 hours and 41 minutes delayed. Curious to see what airlines are the most delayed - it was HA 51, JFK to HNL,
For the earliest departing flight, it can be found by sorting dep_delay in ascending order.
arrange(flights, dep_delay)
The earliest flights were noted to depart December 7, 2013 with dep_time 2040, a 43 minutes early than its scheduled dept time of 2123.
I think to have the fastest flights, it should have the shortest air_time. Adding head() is to ensure it will be visible.
head(arrange(flights, (air_time)))
Fastest flights here are those with 20 minute flights fromEWR to BDL.
Or, we can compute to get the ground speed by dividing distance over air_time.
arrange(flights, desc(distance/air_time))
Based on ground speed, DL 1499 from LGA to ATL is the fastest.
Sort the flights by the distance column in descending order to get the farthest flight.
arrange(flights, desc(distance))
In the data, it looks like the flight going to HNL from JFK is the farthest.
arrange(flights, air_time)
Flight UV 4368 from EWR and BDL is the shortest.
In selecting columns, I think we have few ways to do it.
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, 4, 6, 7, 9)
This works, but is not good practice for two reasons. First, the column location of variables may change, resulting in code that may continue to run without error, but produce the wrong answer. Second code is obfuscated, since it is not clear from the code which variables are being selected. What variable does column 6 correspond to? I just wrote the code, and I’ve already forgotten.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
This is useful because the names of the variables can be stored in a variable and passed to all_of() or any_of().
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, all_of(variables))
These two functions replace the deprecated function one_of().
Selecting the variables by matching the start of their names using starts_with().
select(flights, starts_with("dep_"), starts_with("arr_"))
Selecting the variables using regular expressions with matches(). Regular expressions provide a flexible way to match string patterns and are discussed in the Strings chapter.
select(flights, matches("^(dep|arr)_(time|delay)$"))
Specify the names of the variables with a character vector and use the bang-bang operator (!!).
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!variables)
This and the following answers use the features of tidy evaluation not covered in R4DS but covered in the Programming with dplyr vignette.
Specify the names of the variables in a character or list vector and use the bang-bang-bang operator.
variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, !!!variables)
Specify the unquoted names of the variables in a list using syms() and use the bang-bang-bang operator.
variables <- syms(c("dep_time", "dep_delay", "arr_time", "arr_delay"))
select(flights, !!!variables)
There are Some things that don’t work:
Matching the ends of their names using ends_with() since this will incorrectly include other variables. For example,
select(flights, ends_with("arr_time"), ends_with("dep_time"))
Matching the names using contains() since there is not a pattern that can include all these variables without incorrectly including others.
select(flights, contains("_time"), contains("arr_"))
select(flights, dep_time, arr_time, carrier, dep_time, dep_time)
It looks like it ignores the duplicate variables in displaying. We need not see any error or warnings. Just it let it pass.
select(flights, everything())
This behavior is useful because it means that we can use select() with everything() in order to easily change the order of columns without having to specify the names of all the columns.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
Per documentation,
The one_of() function selects variables with a character vector rather than unquoted variable name arguments. This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.
select(flights, one_of(vars))
There are also other similar functions: all_of() and any_of()
select(flights, all_of(vars))
select(flights, any_of(vars))
These functions differ in their strictness. The function all_of() will raise an error if one of the variable names is not present, while any_of() will ignore it.
The select() function can now accept the name of a vector containing the variable names you wish to select:
select(flights, vars)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(vars)` instead of `vars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
However there is a problem with the previous code. The name vars could refer to a column named vars in flights or a different variable named vars. What th code does will depend on whether or not vars is a column in flights. If vars was a column in flights, then that code would only select the vars column. For example:
flights <- mutate(flights, vars = 1)
select(flights, !!!vars)
This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.
select(flights, contains("TIME"))